{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ab8b509f",
   "metadata": {},
   "source": [
    "\n",
    "# Exercise 1: Data Cleaning and Exploratory Data Analysis\n",
    "\n",
    "This notebook covers data cleaning, handling missing values, and basic exploratory data analysis (EDA) techniques. Dataset links are provided for practice.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a9905af",
   "metadata": {},
   "source": [
    "\n",
    "## Dataset Information and Download Links\n",
    "\n",
    "This exercise uses a modified **Diabetes dataset**, which can be downloaded from the following source:\n",
    "\n",
    "1. **Kaggle:**\n",
    "   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)\n",
    "   - This dataset includes patient data for diabetes-related research.\n",
    "\n",
    "### Dataset Attributes\n",
    "\n",
    "- **Pregnancies**: Number of pregnancies.\n",
    "- **Glucose**: Plasma glucose concentration.\n",
    "- **BloodPressure**: Diastolic blood pressure (mm Hg).\n",
    "- **SkinThickness**: Triceps skinfold thickness (mm).\n",
    "- **Insulin**: 2-Hour serum insulin (mu U/ml).\n",
    "- **BMI**: Body mass index (weight in kg/(height in m)^2).\n",
    "- **DiabetesPedigreeFunction**: Diabetes pedigree function.\n",
    "- **Age**: Age of the patient.\n",
    "- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).\n",
    "\n",
    "### Usage Notes\n",
    "\n",
    "- Ensure the dataset is preprocessed (e.g., handle missing values and normalize if necessary).\n",
    "- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for more information.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a5de55e",
   "metadata": {},
   "source": [
    "\n",
    "## Handling Missing Values\n",
    "\n",
    "Identify and handle missing values in the dataset to ensure accuracy during analysis.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e7a7fc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "\n",
    "# Load the dataset (replace with your file path)\n",
    "data = pd.read_csv(r'C:\\Path\\to\\diabetes.csv')\n",
    "\n",
    "# Check for missing values\n",
    "print(\"Missing values before cleaning:\")\n",
    "print(data.isna().sum())\n",
    "\n",
    "# Fill missing values with median\n",
    "data = data.fillna(data.median())\n",
    "\n",
    "# Verify missing values are handled\n",
    "print(\"Missing values after cleaning:\")\n",
    "print(data.isna().sum())\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77496d8c",
   "metadata": {},
   "source": [
    "\n",
    "## Detecting and Removing Duplicates\n",
    "\n",
    "Duplicates can affect data integrity. This step identifies and removes duplicate rows if present.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1fa774f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Check for duplicates\n",
    "duplicate_count = data.duplicated().sum()\n",
    "print(f\"Number of duplicate rows: {duplicate_count}\")\n",
    "\n",
    "# Remove duplicates\n",
    "data = data.drop_duplicates()\n",
    "\n",
    "print(f\"Number of rows after removing duplicates: {len(data)}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "390dd1db",
   "metadata": {},
   "source": [
    "\n",
    "## Descriptive Statistics\n",
    "\n",
    "Calculate summary statistics to understand the distribution of numerical variables.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35560602",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Display summary statistics\n",
    "print(data.describe())\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30808a02",
   "metadata": {},
   "source": [
    "\n",
    "## Visualizing Distributions\n",
    "\n",
    "Use histograms to visualize the distribution of glucose levels and other key variables.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b69149fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Plot distribution of glucose levels\n",
    "sns.histplot(data['Glucose'], kde=True)\n",
    "plt.title(\"Distribution of Glucose Levels\")\n",
    "plt.xlabel(\"Glucose\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bf3007a",
   "metadata": {},
   "source": [
    "\n",
    "## Correlation Analysis\n",
    "\n",
    "Analyze correlations between numerical variables using a heatmap.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5667f343",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Compute correlation matrix\n",
    "corr_matrix = data.corr()\n",
    "\n",
    "# Plot correlation heatmap\n",
    "sns.heatmap(corr_matrix, annot=True, cmap=\"coolwarm\")\n",
    "plt.title(\"Correlation Matrix\")\n",
    "plt.show()\n",
    "        "
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}
